1 Introduction

1.1 What is the SAM file format?

From Wikipedia, the free encyclopedia:

Sequence Alignment Map (SAM) is a text-based format originally for storing biological sequences aligned to a reference sequence developed by Heng Li and Bob Handsaker et al.[1] It was developed when the 1000 Genomes Project wanted to move away from the MAQ mapper format and decided to design a new format. The overall TAB-delimited flavour of the format came from an earlier format inspired by BLAT’s PSL. The name of SAM came from Gabor Marth from University of Utah, who originally had a format under the same name but with a different syntax more similar to a BLAST output.[2] It is widely used for storing data, such as nucleotide sequences, generated by next generation sequencing technologies, and the standard has been broadened to include unmapped sequences.[3] The format supports short and long reads (up to 128 Mbp) produced by different sequencing platforms and is used to hold mapped data within the Genome Analysis Toolkit (GATK) and across the Broad Institute, the Wellcome Sanger Institute, and throughout the 1000 Genomes Project.

2 What is the Samtools software?

From Wikipedia, the free encyclopedia:

SAMtools is a set of utilities for interacting with and post-processing short DNA sequence read alignments in the SAM (Sequence Alignment/Map), BAM (Binary Alignment/Map) and CRAM formats, written by Heng Li. These files are generated as output by short read aligners like BWA. Both simple and advanced tools are provided, supporting complex tasks like variant calling and alignment viewing as well as sorting, indexing, data extraction and format conversion.[3] SAM files can be very large (10s of Gigabytes is common), so compression is used to save space. SAM files are human-readable text files, and BAM files are simply their binary equivalent, whilst CRAM files are a restructured column-oriented binary container format. BAM files are typically compressed and more efficient for software to work with than SAM. SAMtools makes it possible to work directly with a compressed BAM file, without having to uncompress the whole file. Additionally, since the format for a SAM/BAM file is somewhat complex - containing reads, references, alignments, quality information, and user-specified annotations - SAMtools reduces the effort needed to use SAM/BAM files by hiding low-level details.

4 Tutorial

The first thing to do is create a directory to store all the tutorial output files and data:

bash
mkdir tutorial

4.1 Software installation

Create a new environment called ‘alignment’ with all the required software installed:

bash
conda create --yes --name alignment bowtie2 bwa entrez-direct samtools seqkit sra-tools=2.11.0=pl5262h37d2149_1 # only this version works currently (21/09/2022)
## Collecting package metadata (current_repodata.json): ...working... done
## Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
## Collecting package metadata (repodata.json): ...working... done
## Solving environment: ...working... done
## 
## ## Package Plan ##
## 
##   environment location: /opt/miniconda3/envs/alignment
## 
##   added / updated specs:
##     - bowtie2
##     - bwa
##     - entrez-direct
##     - samtools
##     - seqkit
##     - sra-tools==2.11.0=pl5262h37d2149_1
## 
## 
## The following NEW packages will be INSTALLED:
## 
##   bowtie2            bioconda/osx-64::bowtie2-2.4.5-py310h7d4de36_4
##   bwa                bioconda/osx-64::bwa-0.7.17-h1f540d2_9
##   bzip2              conda-forge/osx-64::bzip2-1.0.8-h0d85af4_4
##   c-ares             conda-forge/osx-64::c-ares-1.18.1-h0d85af4_0
##   ca-certificates    conda-forge/osx-64::ca-certificates-2022.9.14-h033912b_0
##   curl               conda-forge/osx-64::curl-7.83.1-h23f1065_0
##   entrez-direct      bioconda/osx-64::entrez-direct-16.2-h193322a_1
##   gettext            conda-forge/osx-64::gettext-0.19.8.1-hd1a6beb_1008
##   hdf5               conda-forge/osx-64::hdf5-1.10.6-nompi_haae91d6_101
##   htslib             bioconda/osx-64::htslib-1.16-h567f53e_0
##   icu                conda-forge/osx-64::icu-70.1-h96cf925_0
##   krb5               conda-forge/osx-64::krb5-1.19.3-hb98e516_0
##   libcurl            conda-forge/osx-64::libcurl-7.83.1-h23f1065_0
##   libcxx             conda-forge/osx-64::libcxx-14.0.6-hccf4f1f_0
##   libdeflate         conda-forge/osx-64::libdeflate-1.13-h775f41a_0
##   libedit            conda-forge/osx-64::libedit-3.1.20191231-h0678c8f_2
##   libev              conda-forge/osx-64::libev-4.33-haf1e3a3_1
##   libffi             conda-forge/osx-64::libffi-3.4.2-h0d85af4_5
##   libgfortran        conda-forge/osx-64::libgfortran-4.0.0-7_5_0_h1a10cd1_23
##   libgfortran4       conda-forge/osx-64::libgfortran4-7.5.0-h1a10cd1_23
##   libiconv           conda-forge/osx-64::libiconv-1.16-haf1e3a3_0
##   libidn2            conda-forge/osx-64::libidn2-2.3.3-hac89ed1_0
##   libnghttp2         conda-forge/osx-64::libnghttp2-1.47.0-h5aae05b_1
##   libsqlite          conda-forge/osx-64::libsqlite-3.39.3-ha978bb4_0
##   libssh2            conda-forge/osx-64::libssh2-1.10.0-h47af595_3
##   libunistring       conda-forge/osx-64::libunistring-0.9.10-h0d85af4_0
##   libxml2            conda-forge/osx-64::libxml2-2.9.14-hea49891_4
##   libzlib            conda-forge/osx-64::libzlib-1.2.12-hfd90126_3
##   llvm-openmp        conda-forge/osx-64::llvm-openmp-14.0.4-ha654fa7_0
##   ncbi-ngs-sdk       bioconda/osx-64::ncbi-ngs-sdk-2.11.2-h247ad82_0
##   ncurses            conda-forge/osx-64::ncurses-6.3-h96cf925_1
##   openssl            conda-forge/osx-64::openssl-3.0.5-hfd90126_2
##   ossuuid            conda-forge/osx-64::ossuuid-1.6.2-h0a44026_1000
##   perl               conda-forge/osx-64::perl-5.26.2-hbcb3906_1008
##   perl-uri           bioconda/osx-64::perl-uri-1.71-pl526_3
##   perl-xml-libxml    bioconda/osx-64::perl-xml-libxml-2.0132-pl526h08abf6f_1
##   perl-xml-namespac~ bioconda/osx-64::perl-xml-namespacesupport-1.11-pl526_1
##   perl-xml-sax       bioconda/osx-64::perl-xml-sax-0.99-pl526_1
##   perl-xml-sax-base  bioconda/osx-64::perl-xml-sax-base-1.09-pl526_0
##   pip                conda-forge/noarch::pip-22.2.2-pyhd8ed1ab_0
##   python             conda-forge/osx-64::python-3.10.6-hc14f532_0_cpython
##   python_abi         conda-forge/osx-64::python_abi-3.10-2_cp310
##   readline           conda-forge/osx-64::readline-8.1.2-h3899abd_0
##   samtools           bioconda/osx-64::samtools-1.15.1-h7e39424_1
##   seqkit             bioconda/osx-64::seqkit-2.3.1-h527b516_0
##   setuptools         conda-forge/noarch::setuptools-65.3.0-pyhd8ed1ab_1
##   sra-tools          bioconda/osx-64::sra-tools-2.11.0-pl5262h37d2149_1
##   tbb                conda-forge/osx-64::tbb-2021.5.0-hb8565cd_5
##   tk                 conda-forge/osx-64::tk-8.6.12-h5dbffcc_0
##   tzdata             conda-forge/noarch::tzdata-2022c-h191b570_0
##   wget               conda-forge/osx-64::wget-1.20.3-hd3787cc_1
##   wheel              conda-forge/noarch::wheel-0.37.1-pyhd8ed1ab_0
##   xz                 conda-forge/osx-64::xz-5.2.6-h775f41a_0
##   zlib               conda-forge/osx-64::zlib-1.2.12-hfd90126_3
##   zstd               conda-forge/osx-64::zstd-1.5.2-hfa58983_4
## 
## 
## Preparing transaction: ...working... done
## Verifying transaction: ...working... done
## Executing transaction: ...working... done
## #
## # To activate this environment, use
## #
## #     $ conda activate alignment
## #
## # To deactivate an active environment, use
## #
## #     $ conda deactivate
## 
## Retrieving notices: ...working... done

Activate the ‘alignment’ environment:

bash
conda activate alignment

Test that Bowtie 2 and BWA are available:

bash
which samtools
## /opt/miniconda3/envs/alignment/bin/samtools

4.2 Data preparation

Download the Ebola virus reference genome:

bash
efetch -db nuccore -id AF086833 -format fasta > tutorial/AF086833.fasta

Download 1,000 sequencing reads from public data:

bash
fastq-dump -X 1000 -O tutorial --split-files SRR1972739
## Read 1000 spots for SRR1972739
## Written 1000 spots for SRR1972739

Index the reference genome:

bash
bwa index -p tutorial/AF086833 tutorial/AF086833.fasta
## [bwa_index] Pack FASTA... 0.00 sec
## [bwa_index] Construct BWT for the packed sequence...
## [bwa_index] 0.00 seconds elapse.
## [bwa_index] Update BWT... 0.00 sec
## [bwa_index] Pack forward-only FASTA... 0.00 sec
## [bwa_index] Construct SA from BWT and Occ... 0.00 sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa index -p tutorial/AF086833 tutorial/AF086833.fasta
## [main] Real time: 0.005 sec; CPU: 0.023 sec

Align reads using BWA to the reference genome:

bash
bwa mem tutorial/AF086833 tutorial/SRR1972739_1.fastq tutorial/SRR1972739_2.fastq > tutorial/SRR1972739.sam
## [M::bwa_idx_load_from_disk] read 0 ALT contigs
## [M::process] read 2000 sequences (202000 bp)...
## [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (63, 590, 2, 68)
## [M::mem_pestat] analyzing insert size distribution for orientation FF...
## [M::mem_pestat] (25, 50, 75) percentile: (85, 119, 183)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 379)
## [M::mem_pestat] mean and std.dev: (135.27, 62.49)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 477)
## [M::mem_pestat] analyzing insert size distribution for orientation FR...
## [M::mem_pestat] (25, 50, 75) percentile: (176, 219, 275)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 473)
## [M::mem_pestat] mean and std.dev: (227.64, 75.24)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 572)
## [M::mem_pestat] skip orientation RF as there are not enough pairs
## [M::mem_pestat] analyzing insert size distribution for orientation RR...
## [M::mem_pestat] (25, 50, 75) percentile: (71, 147, 192)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 434)
## [M::mem_pestat] mean and std.dev: (149.31, 89.10)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 555)
## [M::mem_process_seqs] Processed 2000 reads in 0.107 CPU sec, 0.107 real sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa mem tutorial/AF086833 tutorial/SRR1972739_1.fastq tutorial/SRR1972739_2.fastq
## [main] Real time: 0.110 sec; CPU: 0.114 sec

4.3 SAM reading/writing

The SAM format is used to store output from read alignment software. It is a TAB-delimited text format consisting of a header and alignment section. To look at each of these sections, we can use the view command.

Print the SAM header section:

bash
samtools view -H tutorial/SRR1972739.sam
## @SQ  SN:AF086833.2   LN:18959
## @PG  ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem tutorial/AF086833 tutorial/SRR1972739_1.fastq tutorial/SRR1972739_2.fastq
## @PG  ID:samtools PN:samtools PP:bwa  VN:1.15.1   CL:samtools view -H tutorial/SRR1972739.sam

The header section primarily contains metadata about the file and reference sequence. For example, whether the alignments in the file have been sorted. Refer to the SAM specification for full details.

Print the SAM alignment section:

bash
# Output the first few lines
samtools view tutorial/SRR1972739.sam | head
## SRR1972739.1 83  AF086833.2  15684   60  69M32S  =   15600   -153    TTTAGATTTAACAAGATACCGAGAAAATGAATTGATTTATGACAATAATCCTCTAAAAGGAGGACTCAAATGAGATATTGCAATTGAGTCCTCCTTTTAGA   DDDDDEEEEEDEEFFFEDHHHHIJJJJJJJJJJJJJJJJJIGIJJJJJJJJJJJJJJJJJJJIGIGJJJJJJJJJJJJJJIJJJJJJIFHHHHFFFFFCCC   NM:i:2  MD:Z:27C16G24   MC:Z:101M   AS:i:59 XS:i:0  SA:Z:AF086833.2,15735,+,33M68S,60,0;
## SRR1972739.1 2115    AF086833.2  15735   60  33M68H  =   15600   -136    TCTAAAAGGAGGACTCAATTGCAATATCTCATT   CCCFFFFFHHHHFIJJJJJJIJJJJJJJJJJJJ   NM:i:0  MD:Z:33 MC:Z:101M   AS:i:33 XS:i:0  SA:Z:AF086833.2,15684,-,69M32S,60,2;
## SRR1972739.1 163 AF086833.2  15600   60  101M    =   15684   153 GTATAATCGTGCTCACCTTCATCTAACTAAGTGTTGCACCCGGGAGGTACCAGCTCAGTACTTAACATACACATCTACATTGGATTTAGATTTAACAAGAT   @BBFFFFFHHHHHJJJJJJJJJJJJJIIIJJGHIIJJJJJJJJIJJJGIJGIJJIJJJIJHHHHHHGFFFFFEEEEEEEDFDDDDDDDDDDEDDEDDDDDD   NM:i:3  MD:Z:0A44A14T40 MC:Z:69M32S AS:i:90 XS:i:0
## SRR1972739.2 115 AF086833.2  4919    60  101M    =   4863    -104    CAGGCTCCTGCGAATTGGAAACCAGGCTTTCCTCCAGGAGTTCGTTCTTCCACCAGTCCAACTACCCCAGTATTTCACCTTTGATTTGACAGCACTCAAAC   DDDDDDDDDDDDDDDEDDDDCDDEEEEEFFFHHHHHIIIJJJJJJJJJIJHGJJJJIJJIIIHFJJJJJJJJJIHGFJJJJJJJJJJIHHHGHFFFFFBBB   NM:i:1  MD:Z:51G49  MC:Z:47S54M AS:i:96 XS:i:0
## SRR1972739.2 179 AF086833.2  4863    60  47S54M  =   4919    104 TGGTTTCCAATTCGCAGGAGCCTGAGGGGGTGATCCGGGATTCCAGGACCAATCCGCTTGTCAGAGTCAATCGGCTGGGTCCTGGAATCCCGGATCACCCC   BDDDDDDDDDDDDCDDDBDDDDDDDDBDDDDDDDDBDDDEDEECEFFDFFHHGJJIFJJJJJIJJJIJJJHIJJJJJJIHFBJJIHIJHGHHHEDD?F@@@   NM:i:2  MD:Z:8A41T3 MC:Z:101M   AS:i:45 XS:i:0  SA:Z:AF086833.2,4893,+,51S50M,60,1;
## SRR1972739.2 2211    AF086833.2  4893    60  51H50M  =   4919    127 GGTCCTGGAATCCCGGATCACCCCCTCAGGCTCCTGCGAATTGGAAACCA  FFDFFECEEDEDDDBDDDDDDDDBDDDDDDDDBDDDCDDDDDDDDDDDDB  NM:i:1  MD:Z:20T29  MC:Z:101M   AS:i:45 XS:i:0  SA:Z:AF086833.2,4863,-,47S54M,60,2;
## SRR1972739.3 67  AF086833.2  2531    60  60M41S  =   2587    57  TATCCGGACTCCCTTGAAGAGGAATATCCACCATGGCTCACTGAAAAAGAGGCCATGAATTTATTCCTGTGATTCATTACTGGCCAATAAAATTGTTGACC   CCCFFFFFHHHHHJJJIIIIJJJJJJJJJJJJJJJJJJJJJIJJIIJJJJFHIJIJJJJJJJHGHGHHHFFFFFFFEFEEEEEDDDDDDDDDDCDCDDDDD   NM:i:2  MD:Z:5A47T6 MC:Z:81M20S AS:i:50 XS:i:0  SA:Z:AF086833.2,2618,-,50M51S,44,3;
## SRR1972739.3 2131    AF086833.2  2618    44  50M51H  =   2587    -81 GGTCAACAATTTTATTGGCCAGTAATGAATCACAGGAATAAATTCATGGC  DDDDDCDCDDDDDDDDDDEEEEEFEFFFFFFFHHHGHGHJJJJJJJIJIH  NM:i:3  MD:Z:20G2G10A15 MC:Z:81M20H AS:i:35 XS:i:0  SA:Z:AF086833.2,2531,+,60M41S,60,2;
## SRR1972739.3 131 AF086833.2  2587    60  81M20S  =   2531    -57 GAATGATGAGAATAGATTTGTTACACTGGATGGTCAACAATTTTATTGGCCAGTAATGAATCACAGGAATAAATTCATGGCCTCTTTTTCAGTGAGCCATG   CC@FFFFFHHHHHJJJJJJJJJJJJJJJJJJJJGIJJJJJJJJJJJJJIJJJJHHJIHJJIGIJIIJIJIJJJJJIIJIGHHHHFFFFFCCEEECCEDDC@   NM:i:5  MD:Z:6A18T25G2G10A15    MC:Z:60M41S AS:i:56 XS:i:0
## SRR1972739.4 77  *   0   0   *   *   0   0   CTCTAATGAATCCCTTAAGCCAGGAGCGTTAGTGTCAAACGCAACGCCCCGGGTTCATGATCCTGGATGGCTGGTCGAACCAGGGAGATGTCACTCTAATA   CCCFFFFFHHHHHJJJJJJJJJIGIJJJIJJJJIGIJJJJHIJJJJIIJJHFFCDEEEDDDDDDDDDDDDDDDBCDDDDDDDDDDDDDDDEDEDDDDDDED   AS:i:0  XS:i:0
## samtools view: writing to standard output failed: Broken pipe
## samtools view: error closing standard output: -1

The alignment section contains data about the input sequence alignments. Each line typically represents a single alignment and contains information about how the alignment was generated. For example, which reference sequence and at what position was the input sequence aligned. Again, refer to the SAM specification for full details.

It’s important to note that we rarely work with alignments in this text-based format. Usually, we convert the SAM file into a binary or compressed representation which we refer to as the BAM format. This allows output from large sequencing runs to be stored efficiently. It also helps if we sort the alignments by their mapping position as this allows us to produce what’s called an ‘index’ file. As the name suggests, this file is analogous to the index in a book. It tells us which part of the file to read to fetch alignments from a particular region of the reference sequence. To sort the alignments and output a BAM file, we can use the sort command.

Sort and convert from SAM to BAM format:

bash
samtools sort tutorial/SRR1972739.sam > tutorial/SRR1972739.bam

You should be aware that older versions of samtools required two commands to achieve the same outcome:

bash
samtools view -b tutorial/SRR1972739.sam > tutorial/SRR1972739.bam
samtools sort tutorial/SRR1972739.bam > tutorial/SRR1972739.sorted.bam

Once we have the sorted BAM file, we can then create an index file. We do this using the index command. Notably, we do not have to specify an output file name for this command. The index file is automatically created and named by appending .bai to the input file name.

Build an index for the BAM file:

bash
samtools index tutorial/SRR1972739.bam

List the BAM and index file:

bash
ls tutorial/SRR1972739.*
## tutorial/SRR1972739.bam
## tutorial/SRR1972739.bam.bai
## tutorial/SRR1972739.sam

4.4 Alignment statistics

It’s often useful to calculate various statistics on alignment files. These statistics can be used to answer a range of questions, some needed for basic quality control and others which have biological significance. We will go through the most common statistics which can be calculated by the samtools commands. Alignment summary statistics can be reported using the idxstats command. The output is TAB-delimited with each line consisting of reference sequence name, sequence length, number of mapped reads, and number of unmapped reads.

bash
samtools idxstats tutorial/SRR1972739.bam
## AF086833.2   18959   1549    0
## *    0   0   524

In addition to reporting the number of alignments by reference sequence, we can use the flagstat command to report the number of alignments for each FLAG value as described in the SAM format specification.

bash
samtools flagstat tutorial/SRR1972739.bam
## 2073 + 0 in total (QC-passed reads + QC-failed reads)
## 2000 + 0 primary
## 0 + 0 secondary
## 73 + 0 supplementary
## 0 + 0 duplicates
## 0 + 0 primary duplicates
## 1549 + 0 mapped (74.72% : N/A)
## 1476 + 0 primary mapped (73.80% : N/A)
## 2000 + 0 paired in sequencing
## 1000 + 0 read1
## 1000 + 0 read2
## 1466 + 0 properly paired (73.30% : N/A)
## 1476 + 0 with itself and mate mapped
## 0 + 0 singletons (0.00% : N/A)
## 0 + 0 with mate mapped to a different chr
## 0 + 0 with mate mapped to a different chr (mapQ>=5)

The flagstat report is helpful to assess the overall alignment performance. For example, a low mapping rate may indicate a significant problem with the library and will require further investigation. If you ever find yourself confused about what a FLAG value represents, you can use the flags command to convert between the value and it’s description.

Convert FLAG value to description:

bash
samtools flags PAIRED
## 0x1  1   PAIRED

Convert description to FLAG value:

bash
samtools flags 1
## 0x1  1   PAIRED

In many genomics applications it is important to compute both read coverage and read depth at each position or region:

  • Read coverage is defined as the proportion of a region which is covered by alignments. For example, a set of alignments might cover only 10% of the full reference sequence.
  • Read depth is defined as the number of alignments that cover a particular position. For example, the 1st base in the reference sequence may be covered by 100 alignments.

To calculate read coverage, we can use the coverage command. The output contains some useful values like the number of covered bases (covbases), the percentage of covered bases (coverage), and the mean depth of coverage (meandepth).

bash
samtools coverage tutorial/SRR1972739.bam
## #rname   startpos    endpos  numreads    covbases    coverage    meandepth   meanbaseq   meanmapq
## AF086833.2   1   18959   1549    18545   97.8163 7.74941 38  59.9

To calculate read depth, we can use the depth command. The output is TAB-delimited with each line consisting of reference sequence name, coordinate, and number of mapped reads or alignments.

bash
# Output the first few lines
samtools depth tutorial/SRR1972739.bam | head
## AF086833.2   114 1
## AF086833.2   115 1
## AF086833.2   116 1
## AF086833.2   117 1
## AF086833.2   118 1
## AF086833.2   119 1
## AF086833.2   120 1
## AF086833.2   121 1
## AF086833.2   122 1
## AF086833.2   123 2

4.5 Alignment filtering

For many genomics applications, the SAM file produced by most read alignment software will require some additional processing. You might have to remove unmapped reads, select only high-quality alignments, or use just a subset of regions. All of these tasks and more can be accomplished using the view command. There are a number of options available in this command, so it is worth examining the help information first:

bash
samtools view --help
## 
## Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]
## 
## Output options:
##   -b, --bam                  Output BAM
##   -C, --cram                 Output CRAM (requires -T)
##   -1, --fast                 Use fast BAM compression (and default to --bam)
##   -u, --uncompressed         Uncompressed BAM output (and default to --bam)
##   -h, --with-header          Include header in SAM output
##   -H, --header-only          Print SAM header only (no alignments)
##       --no-header            Print SAM alignment records only [default]
##   -c, --count                Print only the count of matching records
##   -o, --output FILE          Write output to FILE [standard output]
##   -U, --unoutput FILE, --output-unselected FILE
##                              Output reads not selected by filters to FILE
##   -p, --unmap                Set flag to UNMAP on reads not selected
##                              then write to output file.
##   -P, --fetch-pairs          Retrieve complete pairs even when outside of region
## Input options:
##   -t, --fai-reference FILE   FILE listing reference names and lengths
##   -M, --use-index            Use index and multi-region iterator for regions
##       --region[s]-file FILE  Use index to include only reads overlapping FILE
##   -X, --customized-index     Expect extra index file argument after <in.bam>
## 
## Filtering options (Only include in output reads that...):
##   -L, --target[s]-file FILE  ...overlap (BED) regions in FILE
##   -r, --read-group STR       ...are in read group STR
##   -R, --read-group-file FILE ...are in a read group listed in FILE
##   -N, --qname-file FILE      ...whose read name is listed in FILE
##   -d, --tag STR1[:STR2]      ...have a tag STR1 (with associated value STR2)
##   -D, --tag-file STR:FILE    ...have a tag STR whose value is listed in FILE
##   -q, --min-MQ INT           ...have mapping quality >= INT
##   -l, --library STR          ...are in library STR
##   -m, --min-qlen INT         ...cover >= INT query bases (as measured via CIGAR)
##   -e, --expr STR             ...match the filter expression STR
##   -f, --require-flags FLAG   ...have all of the FLAGs present
##   -F, --excl[ude]-flags FLAG ...have none of the FLAGs present
##       --rf, --incl-flags, --include-flags FLAG
##                              ...have some of the FLAGs present
##   -G FLAG                    EXCLUDE reads with all of the FLAGs present
##       --subsample FLOAT      Keep only FLOAT fraction of templates/read pairs
##       --subsample-seed INT   Influence WHICH reads are kept in subsampling [0]
##   -s INT.FRAC                Same as --subsample 0.FRAC --subsample-seed INT
## 
## Processing options:
##       --add-flags FLAG       Add FLAGs to reads
##       --remove-flags FLAG    Remove FLAGs from reads
##   -x, --remove-tag STR
##                Comma-separated read tags to strip (repeatable) [null]
##       --keep-tag STR
##                Comma-separated read tags to preserve (repeatable) [null].
##                Equivalent to "-x ^STR"
##   -B, --remove-B             Collapse the backward CIGAR operation
## 
## General options:
##   -?, --help   Print long help, including note about region specification
##   -S           Ignored (input format is auto-detected)
##       --no-PG  Do not add a PG line
##       --input-fmt-option OPT[=VAL]
##                Specify a single input file format option in the form
##                of OPTION or OPTION=VALUE
##   -O, --output-fmt FORMAT[,OPT[=VAL]]...
##                Specify output format (SAM, BAM, CRAM)
##       --output-fmt-option OPT[=VAL]
##                Specify a single output file format option in the form
##                of OPTION or OPTION=VALUE
##   -T, --reference FILE
##                Reference sequence FASTA FILE [null]
##   -@, --threads INT
##                Number of additional threads to use [0]
##       --write-index
##                Automatically index the output files [off]
##       --verbosity INT
##                Set level of verbosity
## 
## Notes:
## 
## 1. This command now auto-detects the input format (BAM/CRAM/SAM).
##    Further control over the CRAM format can be specified by using the
##    --output-fmt-option, e.g. to specify the number of sequences per slice
##    and to use avoid reference based compression:
## 
##  samtools view -C --output-fmt-option seqs_per_slice=5000 \
##     --output-fmt-option no_ref -o out.cram in.bam
## 
##    Options can also be specified as a comma separated list within the
##    --output-fmt value too.  For example this is equivalent to the above
## 
##  samtools view --output-fmt cram,seqs_per_slice=5000,no_ref \
##     -o out.cram in.bam
## 
## 2. The file supplied with `-t' is SPACE/TAB delimited with the first
##    two fields of each line consisting of the reference name and the
##    corresponding sequence length. The `.fai' file generated by 
##    `samtools faidx' is suitable for use as this file. This may be an
##    empty file if reads are unaligned.
## 
## 3. SAM->BAM conversion:  samtools view -bT ref.fa in.sam.gz
## 
## 4. BAM->SAM conversion:  samtools view -h in.bam
## 
## 5. A region should be presented in one of the following formats:
##    `chr1', `chr2:1,000' and `chr3:1000-2,000'. When a region is
##    specified, the input alignment file must be a sorted and indexed
##    alignment (BAM/CRAM) file.
## 
## 6. Option `-u' is preferred over `-b' when the output is piped to
##    another samtools command.
## 
## 7. Option `-M`/`--use-index` causes overlaps with `-L` BED file regions and
##    command-line region arguments to be computed using the multi-region iterator
##    and an index. This increases speed, omits duplicates, and outputs the reads
##    as they are ordered in the input SAM/BAM/CRAM file.
## 
## 8. Options `-L`/`--target[s]-file` and `--region[s]-file` may not be used
##    together. `--region[s]-file FILE` is simply equivalent to `-M -L FILE`,
##    so using both causes one of the specified BED files to be ignored.

As you can see, there are a lot of options. These allow the user to perform a huge variety of filtering and file manipulation. These options become especially powerful when combined with other samtools commands. Rather than run through an exhaustive tour of the view command, we will instead demonstrate the most typical use cases.

To filter alignments, we can use the -f and -F options. The first option will output alignments with a given FLAG value, and the second will output alignments without a given FLAG value. This can be quite counter-intuitive so we should look at some examples:

Output alignments where the read is unmapped:

bash
# UNMAP (4)
# With the UNMAP flag (-f 4)
samtools view -f 4 tutorial/SRR1972739.bam > tutorial/unmapped.sam

Output alignments where the read is mapped (not unmapped):

bash
# UNMAP (4)
# Without the UNMAP flag (-F 4)
samtools view -F 4 tutorial/SRR1972739.bam > tutorial/mapped.bam

Remember, that FLAG values can be summed together to represent a combination of FLAG values.

Output alignments where the reads is unmapped and is the first read of the pair:

bash
# UNMAP (4)
# READ1 (64)
# UNMAP,READ1 (4 + 64 = 68)
# With the UNMAP,READ1 flag combination (-f 68)
samtools view -f 68 tutorial/SRR1972739.bam > tutorial/unmapped_read1.bam

Output alignments where the reads is mapped (not unmapped) and passed quality control (did not fail quality control):

bash
# UNMAP (4)
# QCFAIL (512)
# UNMAP,QCFAIL (4 + 512 = 516)
# Without the UNMAP,QCFAIL flag combination (-F 516)
samtools view -F 516 tutorial/SRR1972739.bam > tutorial/mapped_qc.bam

To perform more specific filtering - and to make matters more confusing - you can select and disregard combinations of FLAG values. For example, to output alignments on the reverse strand we need to identify alignments without the unmapped FLAG value and with the reverse complemented FLAG value:

bash
samtools view -c -F 4 -f 16 tutorial/SRR1972739.bam > tutorial/mapped_reverse.bam

Do not be discouraged if you find the FLAG values and the with or with not system of alignment output difficult to remember. Bioinformaticians the world over look this information up on a daily basis and there is no shame in searching online for a solution or reference. Aside from using FLAG values to output specific alignments, we can also choose based on the mapping quality. We can use the -q option to output alignments with mapping quality greater than or equal to a desired value:

bash
# mapping quality >= 10
samtools view -b -q 10 tutorial/SRR1972739.bam > tutorial/SRR1972739.mapq.bam

Lastly, we can also output alignments from a specific region of the reference sequence. This is useful when you may want to focus on alignments which cover a particular region of the genome, like a specific gene body. To output alignments for a given region, simply specify the region on the command line with the required format (NAME:START-END).

bash
# NAME = AF086833
# START = 1
# END = 1000
samtools view -b tutorial/SRR1972739.bam AF086833.2:1-1000 > tutorial/SRR1972739.region.bam

Finally, Samtools also provides a text-based genome browser to visually inspect the alignments:

bash
# display as text (-d T)
# go to this position (-p AF086833.2:9972)
samtools tview -d T -p AF086833.2:9972 tutorial/SRR1972739.bam tutorial/AF086833.fasta
##          9981      9991      10001     10011     10021     10031     10041      
## AGTACTATTTCAGGGTAGTCCAATTAGTGGCACGTCTTTTAGCTGTATATCAGTCGCCCCTGAGATACGCCACAAAAGTG
## .A............A.......GC.....A...................C....T.........................
## ,a,,,,,,,,              .....A...................C....T.........................
## .A............A.......  ,,,,,a,,,,,,,,,,,,,,,,,,,c,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,
## .A............A.......GC.....A..................                                
## .A............A.......GC.....A...................C....T....                     
## .A............A.......GC.....A...................C....T.....................    
## .A............A.......GC.....A.................                                 
## .A............A.......GC.....A...................C....T.........................
## .A............A.......GC.....A...................C....T.......................  
## ,a,,,,,,,,,,,,a,,,,,,,gc,,,,,a,,,,,,,,,,,,,,,,,,,c,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,
##  A............A.......GC.....A...................C....T.........................
##  A............A.......GC.....A...................C....T.........................
##        .......A.......GC.....A...................C....T.........................
##                           ,,,a,,,,,,,,,,,,,,,,,,,c,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,
##                                           .......C....T.........................
##                                             .....C....T.........................

5 Exercise

The exercises below are designed to strengthen your knowledge of using Samtools to manipulate SAM files. The solution to each problem is blurred, only after attempting to solve the problem yourself should you look at the solution. Should you need any help, please ask one of the instructors.

Create a directory to store the output files from each exercise:

bash
mkdir exercises
mkdir exercises/cholera
mkdir exercises/zika
mkdir exercises/tuberculosis

5.1 Cholera

Cholera is an acute diarrhoeal infection caused by eating or drinking food or water that is contaminated with the bacterium Vibrio cholerae. Cholera remains a global threat to public health and is an indicator of inequity and lack of social development. Researchers have estimated that every year, there are 1.3 to 4.0 million cases of cholera, and 21,000 to 143,000 deaths worldwide due to the infection.
World Health Organization
  1. Use the efetch command to download the genome for the Vibrio cholerae bacteria. Note, the genome is split into two entries on the NCBI database: AP014524 and AP014525. You can either download each seperately and then concatenate the files or provide both accession numbers to the efetch command.
bash
efetch -db nuccore -id "AP014524,AP014525" -format fasta > exercises/cholera/MS6.fasta
  1. Use the fastq-dump command to download 1000 single-end reads from the public Vibrio cholerae sequencing run with accession number SRR16345277.
bash
fastq-dump -X 1000 -O exercises/cholera SRR16345277
## Read 1000 spots for SRR16345277
## Written 1000 spots for SRR16345277
  1. Index the bacterial genome using BWA
bash
bwa index exercises/cholera/MS6.fasta
## [bwa_index] Pack FASTA... 0.03 sec
## [bwa_index] Construct BWT for the packed sequence...
## [bwa_index] 0.53 seconds elapse.
## [bwa_index] Update BWT... 0.02 sec
## [bwa_index] Pack forward-only FASTA... 0.01 sec
## [bwa_index] Construct SA from BWT and Occ... 0.17 sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa index exercises/cholera/MS6.fasta
## [main] Real time: 0.768 sec; CPU: 0.783 sec
  1. Align the sequencing reads to the bacterial genome using BWA
bash
bwa mem exercises/cholera/MS6.fasta exercises/cholera/SRR16345277.fastq > exercises/cholera/SRR16345277.sam
## [M::bwa_idx_load_from_disk] read 0 ALT contigs
## [M::process] read 1000 sequences (100000 bp)...
## [M::mem_process_seqs] Processed 1000 reads in 0.023 CPU sec, 0.023 real sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa mem exercises/cholera/MS6.fasta exercises/cholera/SRR16345277.fastq
## [main] Real time: 0.028 sec; CPU: 0.030 sec
  1. Use Samtools to sort and convert the SAM file to BAM format
bash
samtools sort exercises/cholera/SRR16345277.sam > exercises/cholera/SRR16345277.bam
  1. Index the sorted BAM file
bash
samtools index exercises/cholera/SRR16345277.bam
  1. Calculate flag alignment statistics from the BAM file
bash
samtools flagstat exercises/cholera/SRR16345277.bam
## 1006 + 0 in total (QC-passed reads + QC-failed reads)
## 1000 + 0 primary
## 0 + 0 secondary
## 6 + 0 supplementary
## 0 + 0 duplicates
## 0 + 0 primary duplicates
## 975 + 0 mapped (96.92% : N/A)
## 969 + 0 primary mapped (96.90% : N/A)
## 0 + 0 paired in sequencing
## 0 + 0 read1
## 0 + 0 read2
## 0 + 0 properly paired (N/A : N/A)
## 0 + 0 with itself and mate mapped
## 0 + 0 singletons (N/A : N/A)
## 0 + 0 with mate mapped to a different chr
## 0 + 0 with mate mapped to a different chr (mapQ>=5)
  1. Calculate index alignment statistics from the BAM file
bash
samtools idxstats exercises/cholera/SRR16345277.bam
## AP014524.1   2936971 714 0
## AP014525.1   1093973 261 0
## *    0   0   31
  1. Filter the BAM file to contain only MAPPED reads
bash
# UNMAP (4)
samtools view -b -F 4 exercises/cholera/SRR16345277.bam > exercises/cholera/SRR16345277.mapped.bam

5.2 Zika

Zika virus is primarily transmitted by the bite of an infected mosquito from the Aedes genus, mainly Aedes aegypti, in tropical and subtropical regions. Aedes mosquitoes usually bite during the day, peaking during early morning and late afternoon/evening. This is the same mosquito that transmits dengue, chikungunya and yellow fever.
World Health Organization
  1. Use the efetch command from EDirect to download and save the Zika reference genome (AY632535).
bash
efetch -db nuccore -id AY632535 -format fasta > exercises/zika/AY632535.fasta
  1. Use the esearch and efetch commands to download and save the run information from a public Zika sequencing project (PRJNA609110).
bash
esearch -db sra -query PRJNA609110 | efetch -format runinfo > exercises/zika/runinfo.csv
  1. Create a file which contains the run accession numbers of those runs which are single-end.
bash
grep "SINGLE" exercises/zika/runinfo.csv | cut -d "," -f 1 > exercises/zika/runids.txt
  1. Use the fastq-dump command to download 10,000 single-end reads from each of the runs selected in Q3. Instead of manually creating the command for each run accession number, use a for loop to read each run accession number from the file created in Q3.
bash
while read RUN; do

  fastq-dump -X 10000 -O exercises/zika ${RUN}

done < exercises/zika/runids.txt
## Read 10000 spots for SRR14467421
## Written 10000 spots for SRR14467421
## Read 10000 spots for SRR14467422
## Written 10000 spots for SRR14467422
  1. Use the BWA aligner to index the Zika genome:
bash
bwa index exercises/zika/AY632535.fasta
## [bwa_index] Pack FASTA... 0.00 sec
## [bwa_index] Construct BWT for the packed sequence...
## [bwa_index] 0.00 seconds elapse.
## [bwa_index] Update BWT... 0.00 sec
## [bwa_index] Pack forward-only FASTA... 0.00 sec
## [bwa_index] Construct SA from BWT and Occ... 0.00 sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa index exercises/zika/AY632535.fasta
## [main] Real time: 0.004 sec; CPU: 0.021 sec
  1. Align the sequencing reads to the Zika genome use the BWA aligner. Again, use a for loop to construct and run each of the commands.
bash
while read RUN; do

  bwa mem exercises/zika/AY632535.fasta exercises/zika/${RUN}.fastq > exercises/zika/${RUN}.sam

done < exercises/zika/runids.txt
## [M::bwa_idx_load_from_disk] read 0 ALT contigs
## [M::process] read 10000 sequences (1480034 bp)...
## [M::mem_process_seqs] Processed 10000 reads in 0.585 CPU sec, 0.586 real sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa mem exercises/zika/AY632535.fasta exercises/zika/SRR14467421.fastq
## [main] Real time: 0.608 sec; CPU: 0.605 sec
## [M::bwa_idx_load_from_disk] read 0 ALT contigs
## [M::process] read 10000 sequences (1461417 bp)...
## [M::mem_process_seqs] Processed 10000 reads in 0.593 CPU sec, 0.593 real sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa mem exercises/zika/AY632535.fasta exercises/zika/SRR14467422.fastq
## [main] Real time: 0.609 sec; CPU: 0.611 sec
  1. Sort and convert the SAM files to BAM format. Use a loop as before.
bash
while read RUN; do

  samtools sort exercises/zika/${RUN}.sam > exercises/zika/${RUN}.bam

done < exercises/zika/runids.txt
  1. Index the sorted BAM files. Loop again.
bash
while read RUN; do

  samtools index exercises/zika/${RUN}.bam

done < exercises/zika/runids.txt
  1. Use the merge command to combine the alignments from all of the BAM files into a single BAM file.
bash
samtools merge exercises/zika/merge.bam exercises/zika/SRR*.bam
  1. Calculate the read coverage from the merged BAM file and report the percent covered and mean coverage values.
bash
# Percent covered: 5.151%
# Mean coverage: 238x
samtools coverage -m exercises/zika/merge.bam
## AY632535.2 (10.8Kbp)
## >  90.00% │█                                       │ Number of reads: 23138
## >  80.00% │█                                       │ 
## >  70.00% │█                                       │ Covered bases:   556bp
## >  60.00% │█                                       │ Percent covered: 5.151%
## >  50.00% │██                                      │ Mean coverage:   238x
## >  40.00% │██                                      │ Mean baseQ:      34.1
## >  30.00% │██                                      │ Mean mapQ:       56.7
## >  20.00% │██           ▃                          │ 
## >  10.00% │██           █                   ▄      │ Histo bin width: 269bp
## >   0.00% │██           █                  ▆█      │ Histo max bin:   100%
##           1        2.7K      5.4K      8.1K      10.8K

5.3 Tuberculosis

Tuberculosis (TB) is caused by bacteria (Mycobacterium tuberculosis) and it most often affects the lungs. TB is spread through the air when people with lung TB cough, sneeze or spit. A person needs to inhale only a few germs to become infected. Every year, 10 million people fall ill with tuberculosis (TB). Despite being a preventable and curable disease, 1.5 million people die from TB each year – making it the world’s top infectious killer.
World Health Organization
  1. Use the efetch command from EDirect to download and save the Tuberculosis reference genome (AL123456).
bash
efetch -db nuccore -id AL123456 -format fasta > exercises/tuberculosis/AL123456.fasta
  1. Use the esearch and efetch commands to download and save the run information from a public Tuberculosis sequencing project (PRJNA575883).
bash
esearch -db sra -query PRJNA575883 | efetch -format runinfo > exercises/tuberculosis/runinfo.csv
  1. Create a file which contains the run accession numbers of those runs from submission SRA1348739.
bash
cat exercises/tuberculosis/runinfo.csv | grep "SRA1348739" | cut -d "," -f 1 > exercises/tuberculosis/runids.txt
  1. Use the fastq-dump command to download 1,000 paired-end reads from each of the runs selected in Q3. Instead of manually creating the command for each run accession number, use a for loop to read each run accession number from the file created in Q3.
bash
while read RUN; do

  fastq-dump -X 1000 -O exercises/tuberculosis --split-files ${RUN}

done < exercises/tuberculosis/runids.txt
## Read 1000 spots for SRR17333997
## Written 1000 spots for SRR17333997
## Read 1000 spots for SRR17333996
## Written 1000 spots for SRR17333996
## Read 1000 spots for SRR17333995
## Written 1000 spots for SRR17333995
## Read 1000 spots for SRR17333994
## Written 1000 spots for SRR17333994
## Read 1000 spots for SRR17333993
## Written 1000 spots for SRR17333993
## Read 1000 spots for SRR17333992
## Written 1000 spots for SRR17333992
## Read 1000 spots for SRR17333991
## Written 1000 spots for SRR17333991
## Read 1000 spots for SRR17333990
## Written 1000 spots for SRR17333990
## Read 1000 spots for SRR17333989
## Written 1000 spots for SRR17333989
## Read 1000 spots for SRR17333998
## Written 1000 spots for SRR17333998
  1. Use the Bowtie 2 aligner to index the Tuberculosis genome.
bash
bowtie2-build exercises/tuberculosis/AL123456.fasta exercises/tuberculosis/AL123456
## Settings:
##   Output files: "exercises/tuberculosis/AL123456.*.bt2"
##   Line rate: 6 (line is 64 bytes)
##   Lines per side: 1 (side is 64 bytes)
##   Offset rate: 4 (one in 16)
##   FTable chars: 10
##   Strings: unpacked
##   Max bucket size: default
##   Max bucket size, sqrt multiplier: default
##   Max bucket size, len divisor: 4
##   Difference-cover sample period: 1024
##   Endianness: little
##   Actual local endianness: little
##   Sanity checking: disabled
##   Assertions: disabled
##   Random seed: 0
##   Sizeofs: void*:8, int:4, long:8, size_t:8
## Input files DNA, FASTA:
##   exercises/tuberculosis/AL123456.fasta
## Building a SMALL index
## Reading reference sizes
##   Time reading reference sizes: 00:00:00
## Calculating joined length
## Writing header
## Reserving space for joined string
## Joining reference sequences
##   Time to join reference sequences: 00:00:00
## bmax according to bmaxDivN setting: 1102883
## Using parameters --bmax 827163 --dcv 1024
##   Doing ahead-of-time memory usage test
##   Passed!  Constructing with these parameters: --bmax 827163 --dcv 1024
## Constructing suffix-array element generator
## Building DifferenceCoverSample
##   Building sPrime
##   Building sPrimeOrder
##   V-Sorting samples
##   V-Sorting samples time: 00:00:00
##   Allocating rank array
##   Ranking v-sort output
##   Ranking v-sort output time: 00:00:00
##   Invoking Larsson-Sadakane on ranks
##   Invoking Larsson-Sadakane on ranks time: 00:00:00
##   Sanity-checking and returning
## Building samples
## Reserving space for 12 sample suffixes
## Generating random suffixes
## QSorting 12 sample offsets, eliminating duplicates
## QSorting sample offsets, eliminating duplicates time: 00:00:00
## Multikey QSorting 12 samples
##   (Using difference cover)
##   Multikey QSorting samples time: 00:00:00
## Calculating bucket sizes
## Splitting and merging
##   Splitting and merging time: 00:00:00
## Split 1, merged 6; iterating...
## Splitting and merging
##   Splitting and merging time: 00:00:00
## Split 1, merged 1; iterating...
## Splitting and merging
##   Splitting and merging time: 00:00:00
## Split 1, merged 1; iterating...
## Splitting and merging
##   Splitting and merging time: 00:00:00
## Avg bucket size: 630218 (target: 827162)
## Converting suffix-array elements to index image
## Allocating ftab, absorbFtab
## Entering Ebwt loop
## Getting block 1 of 7
##   Reserving size (827163) for bucket 1
##   Calculating Z arrays for bucket 1
##   Entering block accumulator loop for bucket 1:
##   bucket 1: 10%
##   bucket 1: 20%
##   bucket 1: 30%
##   bucket 1: 40%
##   bucket 1: 50%
##   bucket 1: 60%
##   bucket 1: 70%
##   bucket 1: 80%
##   bucket 1: 90%
##   bucket 1: 100%
##   Sorting block of length 645384 for bucket 1
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 645385 for bucket 1
## Getting block 2 of 7
##   Reserving size (827163) for bucket 2
##   Calculating Z arrays for bucket 2
##   Entering block accumulator loop for bucket 2:
##   bucket 2: 10%
##   bucket 2: 20%
##   bucket 2: 30%
##   bucket 2: 40%
##   bucket 2: 50%
##   bucket 2: 60%
##   bucket 2: 70%
##   bucket 2: 80%
##   bucket 2: 90%
##   bucket 2: 100%
##   Sorting block of length 715092 for bucket 2
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 715093 for bucket 2
## Getting block 3 of 7
##   Reserving size (827163) for bucket 3
##   Calculating Z arrays for bucket 3
##   Entering block accumulator loop for bucket 3:
##   bucket 3: 10%
##   bucket 3: 20%
##   bucket 3: 30%
##   bucket 3: 40%
##   bucket 3: 50%
##   bucket 3: 60%
##   bucket 3: 70%
##   bucket 3: 80%
##   bucket 3: 90%
##   bucket 3: 100%
##   Sorting block of length 484041 for bucket 3
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 484042 for bucket 3
## Getting block 4 of 7
##   Reserving size (827163) for bucket 4
##   Calculating Z arrays for bucket 4
##   Entering block accumulator loop for bucket 4:
##   bucket 4: 10%
##   bucket 4: 20%
##   bucket 4: 30%
##   bucket 4: 40%
##   bucket 4: 50%
##   bucket 4: 60%
##   bucket 4: 70%
##   bucket 4: 80%
##   bucket 4: 90%
##   bucket 4: 100%
##   Sorting block of length 491241 for bucket 4
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 491242 for bucket 4
## Getting block 5 of 7
##   Reserving size (827163) for bucket 5
##   Calculating Z arrays for bucket 5
##   Entering block accumulator loop for bucket 5:
##   bucket 5: 10%
##   bucket 5: 20%
##   bucket 5: 30%
##   bucket 5: 40%
##   bucket 5: 50%
##   bucket 5: 60%
##   bucket 5: 70%
##   bucket 5: 80%
##   bucket 5: 90%
##   bucket 5: 100%
##   Sorting block of length 776428 for bucket 5
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 776429 for bucket 5
## Getting block 6 of 7
##   Reserving size (827163) for bucket 6
##   Calculating Z arrays for bucket 6
##   Entering block accumulator loop for bucket 6:
##   bucket 6: 10%
##   bucket 6: 20%
##   bucket 6: 30%
##   bucket 6: 40%
##   bucket 6: 50%
##   bucket 6: 60%
##   bucket 6: 70%
##   bucket 6: 80%
##   bucket 6: 90%
##   bucket 6: 100%
##   Sorting block of length 776107 for bucket 6
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 776108 for bucket 6
## Getting block 7 of 7
##   Reserving size (827163) for bucket 7
##   Calculating Z arrays for bucket 7
##   Entering block accumulator loop for bucket 7:
##   bucket 7: 10%
##   bucket 7: 20%
##   bucket 7: 30%
##   bucket 7: 40%
##   bucket 7: 50%
##   bucket 7: 60%
##   bucket 7: 70%
##   bucket 7: 80%
##   bucket 7: 90%
##   bucket 7: 100%
##   Sorting block of length 523233 for bucket 7
##   (Using difference cover)
##   Sorting block time: 00:00:01
## Returning block of 523234 for bucket 7
## Exited Ebwt loop
## fchr[A]: 0
## fchr[C]: 758552
## fchr[G]: 2208550
## fchr[T]: 3653164
## fchr[$]: 4411532
## Exiting Ebwt::buildToDisk()
## Returning from initFromVector
## Wrote 5665053 bytes to primary EBWT file: exercises/tuberculosis/AL123456.1.bt2.tmp
## Wrote 1102888 bytes to secondary EBWT file: exercises/tuberculosis/AL123456.2.bt2.tmp
## Re-opening _in1 and _in2 as input streams
## Returning from Ebwt constructor
## Headers:
##     len: 4411532
##     bwtLen: 4411533
##     sz: 1102883
##     bwtSz: 1102884
##     lineRate: 6
##     offRate: 4
##     offMask: 0xfffffff0
##     ftabChars: 10
##     eftabLen: 20
##     eftabSz: 80
##     ftabLen: 1048577
##     ftabSz: 4194308
##     offsLen: 275721
##     offsSz: 1102884
##     lineSz: 64
##     sideSz: 64
##     sideBwtSz: 48
##     sideBwtLen: 192
##     numSides: 22977
##     numLines: 22977
##     ebwtTotLen: 1470528
##     ebwtTotSz: 1470528
##     color: 0
##     reverse: 0
## Total time for call to driver() for forward index: 00:00:02
## Reading reference sizes
##   Time reading reference sizes: 00:00:00
## Calculating joined length
## Writing header
## Reserving space for joined string
## Joining reference sequences
##   Time to join reference sequences: 00:00:00
##   Time to reverse reference sequence: 00:00:00
## bmax according to bmaxDivN setting: 1102883
## Using parameters --bmax 827163 --dcv 1024
##   Doing ahead-of-time memory usage test
##   Passed!  Constructing with these parameters: --bmax 827163 --dcv 1024
## Constructing suffix-array element generator
## Building DifferenceCoverSample
##   Building sPrime
##   Building sPrimeOrder
##   V-Sorting samples
##   V-Sorting samples time: 00:00:00
##   Allocating rank array
##   Ranking v-sort output
##   Ranking v-sort output time: 00:00:00
##   Invoking Larsson-Sadakane on ranks
##   Invoking Larsson-Sadakane on ranks time: 00:00:00
##   Sanity-checking and returning
## Building samples
## Reserving space for 12 sample suffixes
## Generating random suffixes
## QSorting 12 sample offsets, eliminating duplicates
## QSorting sample offsets, eliminating duplicates time: 00:00:00
## Multikey QSorting 12 samples
##   (Using difference cover)
##   Multikey QSorting samples time: 00:00:00
## Calculating bucket sizes
## Splitting and merging
##   Splitting and merging time: 00:00:00
## Split 1, merged 5; iterating...
## Splitting and merging
##   Splitting and merging time: 00:00:00
## Avg bucket size: 551441 (target: 827162)
## Converting suffix-array elements to index image
## Allocating ftab, absorbFtab
## Entering Ebwt loop
## Getting block 1 of 8
##   Reserving size (827163) for bucket 1
##   Calculating Z arrays for bucket 1
##   Entering block accumulator loop for bucket 1:
##   bucket 1: 10%
##   bucket 1: 20%
##   bucket 1: 30%
##   bucket 1: 40%
##   bucket 1: 50%
##   bucket 1: 60%
##   bucket 1: 70%
##   bucket 1: 80%
##   bucket 1: 90%
##   bucket 1: 100%
##   Sorting block of length 523071 for bucket 1
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 523072 for bucket 1
## Getting block 2 of 8
##   Reserving size (827163) for bucket 2
##   Calculating Z arrays for bucket 2
##   Entering block accumulator loop for bucket 2:
##   bucket 2: 10%
##   bucket 2: 20%
##   bucket 2: 30%
##   bucket 2: 40%
##   bucket 2: 50%
##   bucket 2: 60%
##   bucket 2: 70%
##   bucket 2: 80%
##   bucket 2: 90%
##   bucket 2: 100%
##   Sorting block of length 549662 for bucket 2
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 549663 for bucket 2
## Getting block 3 of 8
##   Reserving size (827163) for bucket 3
##   Calculating Z arrays for bucket 3
##   Entering block accumulator loop for bucket 3:
##   bucket 3: 10%
##   bucket 3: 20%
##   bucket 3: 30%
##   bucket 3: 40%
##   bucket 3: 50%
##   bucket 3: 60%
##   bucket 3: 70%
##   bucket 3: 80%
##   bucket 3: 90%
##   bucket 3: 100%
##   Sorting block of length 532754 for bucket 3
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 532755 for bucket 3
## Getting block 4 of 8
##   Reserving size (827163) for bucket 4
##   Calculating Z arrays for bucket 4
##   Entering block accumulator loop for bucket 4:
##   bucket 4: 10%
##   bucket 4: 20%
##   bucket 4: 30%
##   bucket 4: 40%
##   bucket 4: 50%
##   bucket 4: 60%
##   bucket 4: 70%
##   bucket 4: 80%
##   bucket 4: 90%
##   bucket 4: 100%
##   Sorting block of length 660062 for bucket 4
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 660063 for bucket 4
## Getting block 5 of 8
##   Reserving size (827163) for bucket 5
##   Calculating Z arrays for bucket 5
##   Entering block accumulator loop for bucket 5:
##   bucket 5: 10%
##   bucket 5: 20%
##   bucket 5: 30%
##   bucket 5: 40%
##   bucket 5: 50%
##   bucket 5: 60%
##   bucket 5: 70%
##   bucket 5: 80%
##   bucket 5: 90%
##   bucket 5: 100%
##   Sorting block of length 692173 for bucket 5
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 692174 for bucket 5
## Getting block 6 of 8
##   Reserving size (827163) for bucket 6
##   Calculating Z arrays for bucket 6
##   Entering block accumulator loop for bucket 6:
##   bucket 6: 10%
##   bucket 6: 20%
##   bucket 6: 30%
##   bucket 6: 40%
##   bucket 6: 50%
##   bucket 6: 60%
##   bucket 6: 70%
##   bucket 6: 80%
##   bucket 6: 90%
##   bucket 6: 100%
##   Sorting block of length 618745 for bucket 6
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 618746 for bucket 6
## Getting block 7 of 8
##   Reserving size (827163) for bucket 7
##   Calculating Z arrays for bucket 7
##   Entering block accumulator loop for bucket 7:
##   bucket 7: 10%
##   bucket 7: 20%
##   bucket 7: 30%
##   bucket 7: 40%
##   bucket 7: 50%
##   bucket 7: 60%
##   bucket 7: 70%
##   bucket 7: 80%
##   bucket 7: 90%
##   bucket 7: 100%
##   Sorting block of length 702923 for bucket 7
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 702924 for bucket 7
## Getting block 8 of 8
##   Reserving size (827163) for bucket 8
##   Calculating Z arrays for bucket 8
##   Entering block accumulator loop for bucket 8:
##   bucket 8: 10%
##   bucket 8: 20%
##   bucket 8: 30%
##   bucket 8: 40%
##   bucket 8: 50%
##   bucket 8: 60%
##   bucket 8: 70%
##   bucket 8: 80%
##   bucket 8: 90%
##   bucket 8: 100%
##   Sorting block of length 132135 for bucket 8
##   (Using difference cover)
##   Sorting block time: 00:00:00
## Returning block of 132136 for bucket 8
## Exited Ebwt loop
## fchr[A]: 0
## fchr[C]: 758552
## fchr[G]: 2208550
## fchr[T]: 3653164
## fchr[$]: 4411532
## Exiting Ebwt::buildToDisk()
## Returning from initFromVector
## Wrote 5665053 bytes to primary EBWT file: exercises/tuberculosis/AL123456.rev.1.bt2.tmp
## Wrote 1102888 bytes to secondary EBWT file: exercises/tuberculosis/AL123456.rev.2.bt2.tmp
## Re-opening _in1 and _in2 as input streams
## Returning from Ebwt constructor
## Headers:
##     len: 4411532
##     bwtLen: 4411533
##     sz: 1102883
##     bwtSz: 1102884
##     lineRate: 6
##     offRate: 4
##     offMask: 0xfffffff0
##     ftabChars: 10
##     eftabLen: 20
##     eftabSz: 80
##     ftabLen: 1048577
##     ftabSz: 4194308
##     offsLen: 275721
##     offsSz: 1102884
##     lineSz: 64
##     sideSz: 64
##     sideBwtSz: 48
##     sideBwtLen: 192
##     numSides: 22977
##     numLines: 22977
##     ebwtTotLen: 1470528
##     ebwtTotSz: 1470528
##     color: 0
##     reverse: 1
## Total time for backward call to driver() for mirror index: 00:00:01
## Renaming exercises/tuberculosis/AL123456.3.bt2.tmp to exercises/tuberculosis/AL123456.3.bt2
## Renaming exercises/tuberculosis/AL123456.4.bt2.tmp to exercises/tuberculosis/AL123456.4.bt2
## Renaming exercises/tuberculosis/AL123456.1.bt2.tmp to exercises/tuberculosis/AL123456.1.bt2
## Renaming exercises/tuberculosis/AL123456.2.bt2.tmp to exercises/tuberculosis/AL123456.2.bt2
## Renaming exercises/tuberculosis/AL123456.rev.1.bt2.tmp to exercises/tuberculosis/AL123456.rev.1.bt2
## Renaming exercises/tuberculosis/AL123456.rev.2.bt2.tmp to exercises/tuberculosis/AL123456.rev.2.bt2
  1. Align the sequencing reads to the Tuberculosis genome use the Bowtie 2 aligner. Additionally, sort, convert, and index the SAM file. Again, use a for loop to construct and run each of the commands.
bash
while read RUN; do

  bowtie2 -x exercises/tuberculosis/AL123456 -1 exercises/tuberculosis/${RUN}_1.fastq -2 exercises/tuberculosis/${RUN}_2.fastq > exercises/tuberculosis/${RUN}.sam
  
  samtools sort exercises/tuberculosis/${RUN}.sam > exercises/tuberculosis/${RUN}.bam
  
  samtools index exercises/tuberculosis/${RUN}.bam

done < exercises/tuberculosis/runids.txt
## 1000 reads; of these:
##   1000 (100.00%) were paired; of these:
##     157 (15.70%) aligned concordantly 0 times
##     822 (82.20%) aligned concordantly exactly 1 time
##     21 (2.10%) aligned concordantly >1 times
##     ----
##     157 pairs aligned concordantly 0 times; of these:
##       65 (41.40%) aligned discordantly 1 time
##     ----
##     92 pairs aligned 0 times concordantly or discordantly; of these:
##       184 mates make up the pairs; of these:
##         145 (78.80%) aligned 0 times
##         34 (18.48%) aligned exactly 1 time
##         5 (2.72%) aligned >1 times
## 92.75% overall alignment rate
## 1000 reads; of these:
##   1000 (100.00%) were paired; of these:
##     119 (11.90%) aligned concordantly 0 times
##     865 (86.50%) aligned concordantly exactly 1 time
##     16 (1.60%) aligned concordantly >1 times
##     ----
##     119 pairs aligned concordantly 0 times; of these:
##       43 (36.13%) aligned discordantly 1 time
##     ----
##     76 pairs aligned 0 times concordantly or discordantly; of these:
##       152 mates make up the pairs; of these:
##         111 (73.03%) aligned 0 times
##         40 (26.32%) aligned exactly 1 time
##         1 (0.66%) aligned >1 times
## 94.45% overall alignment rate
## 1000 reads; of these:
##   1000 (100.00%) were paired; of these:
##     165 (16.50%) aligned concordantly 0 times
##     822 (82.20%) aligned concordantly exactly 1 time
##     13 (1.30%) aligned concordantly >1 times
##     ----
##     165 pairs aligned concordantly 0 times; of these:
##       49 (29.70%) aligned discordantly 1 time
##     ----
##     116 pairs aligned 0 times concordantly or discordantly; of these:
##       232 mates make up the pairs; of these:
##         164 (70.69%) aligned 0 times
##         67 (28.88%) aligned exactly 1 time
##         1 (0.43%) aligned >1 times
## 91.80% overall alignment rate
## 1000 reads; of these:
##   1000 (100.00%) were paired; of these:
##     158 (15.80%) aligned concordantly 0 times
##     824 (82.40%) aligned concordantly exactly 1 time
##     18 (1.80%) aligned concordantly >1 times
##     ----
##     158 pairs aligned concordantly 0 times; of these:
##       42 (26.58%) aligned discordantly 1 time
##     ----
##     116 pairs aligned 0 times concordantly or discordantly; of these:
##       232 mates make up the pairs; of these:
##         158 (68.10%) aligned 0 times
##         73 (31.47%) aligned exactly 1 time
##         1 (0.43%) aligned >1 times
## 92.10% overall alignment rate
## 1000 reads; of these:
##   1000 (100.00%) were paired; of these:
##     159 (15.90%) aligned concordantly 0 times
##     824 (82.40%) aligned concordantly exactly 1 time
##     17 (1.70%) aligned concordantly >1 times
##     ----
##     159 pairs aligned concordantly 0 times; of these:
##       58 (36.48%) aligned discordantly 1 time
##     ----
##     101 pairs aligned 0 times concordantly or discordantly; of these:
##       202 mates make up the pairs; of these:
##         158 (78.22%) aligned 0 times
##         43 (21.29%) aligned exactly 1 time
##         1 (0.50%) aligned >1 times
## 92.10% overall alignment rate
## 1000 reads; of these:
##   1000 (100.00%) were paired; of these:
##     139 (13.90%) aligned concordantly 0 times
##     851 (85.10%) aligned concordantly exactly 1 time
##     10 (1.00%) aligned concordantly >1 times
##     ----
##     139 pairs aligned concordantly 0 times; of these:
##       41 (29.50%) aligned discordantly 1 time
##     ----
##     98 pairs aligned 0 times concordantly or discordantly; of these:
##       196 mates make up the pairs; of these:
##         136 (69.39%) aligned 0 times
##         56 (28.57%) aligned exactly 1 time
##         4 (2.04%) aligned >1 times
## 93.20% overall alignment rate
## 1000 reads; of these:
##   1000 (100.00%) were paired; of these:
##     187 (18.70%) aligned concordantly 0 times
##     807 (80.70%) aligned concordantly exactly 1 time
##     6 (0.60%) aligned concordantly >1 times
##     ----
##     187 pairs aligned concordantly 0 times; of these:
##       86 (45.99%) aligned discordantly 1 time
##     ----
##     101 pairs aligned 0 times concordantly or discordantly; of these:
##       202 mates make up the pairs; of these:
##         158 (78.22%) aligned 0 times
##         38 (18.81%) aligned exactly 1 time
##         6 (2.97%) aligned >1 times
## 92.10% overall alignment rate
## 1000 reads; of these:
##   1000 (100.00%) were paired; of these:
##     195 (19.50%) aligned concordantly 0 times
##     790 (79.00%) aligned concordantly exactly 1 time
##     15 (1.50%) aligned concordantly >1 times
##     ----
##     195 pairs aligned concordantly 0 times; of these:
##       53 (27.18%) aligned discordantly 1 time
##     ----
##     142 pairs aligned 0 times concordantly or discordantly; of these:
##       284 mates make up the pairs; of these:
##         208 (73.24%) aligned 0 times
##         75 (26.41%) aligned exactly 1 time
##         1 (0.35%) aligned >1 times
## 89.60% overall alignment rate
## 1000 reads; of these:
##   1000 (100.00%) were paired; of these:
##     215 (21.50%) aligned concordantly 0 times
##     775 (77.50%) aligned concordantly exactly 1 time
##     10 (1.00%) aligned concordantly >1 times
##     ----
##     215 pairs aligned concordantly 0 times; of these:
##       83 (38.60%) aligned discordantly 1 time
##     ----
##     132 pairs aligned 0 times concordantly or discordantly; of these:
##       264 mates make up the pairs; of these:
##         229 (86.74%) aligned 0 times
##         35 (13.26%) aligned exactly 1 time
##         0 (0.00%) aligned >1 times
## 88.55% overall alignment rate
## 1000 reads; of these:
##   1000 (100.00%) were paired; of these:
##     254 (25.40%) aligned concordantly 0 times
##     737 (73.70%) aligned concordantly exactly 1 time
##     9 (0.90%) aligned concordantly >1 times
##     ----
##     254 pairs aligned concordantly 0 times; of these:
##       131 (51.57%) aligned discordantly 1 time
##     ----
##     123 pairs aligned 0 times concordantly or discordantly; of these:
##       246 mates make up the pairs; of these:
##         227 (92.28%) aligned 0 times
##         15 (6.10%) aligned exactly 1 time
##         4 (1.63%) aligned >1 times
## 88.65% overall alignment rate
  1. Filter the BAM files so they contain only properly paired alignments. Remember to index the new BAM file as well.
bash
while read RUN; do

  # Output BAM (-b)

  # FLAGs present (PROPER_PAIR = 2)

  samtools view -b -f 2 exercises/tuberculosis/${RUN}.bam > exercises/tuberculosis/${RUN}.PROPER_PAIR.bam
  
  samtools index exercises/tuberculosis/${RUN}.PROPER_PAIR.bam

done < exercises/tuberculosis/runids.txt
  1. Calculate the read depth at each position in the genome across all BAM files into a file.
bash
samtools depth -H exercises/tuberculosis/*.PROPER_PAIR.bam > exercises/tuberculosis/depth.txt